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Abstract — Many websites with an underlying database containing 
structured data provide the richest and most dense source of infor- 
mation relevant for topical data integration. The real data integration 
requires sustainable and reliable pattern discovery to enable accurate 
content retrieval and to recognize pattern changes from time to time; 
yet, extracting the structured data from web documents is still lacking 
from its accuracy. This paper proposes the bar-tree representation to 
describe the whole pattern of web pages in an efficient way based 
on the reverse algorithm. While previous algorithms always trace the 
pattern and extract the region of interest from top root, the reverse 
algorithm recognizes the pattern from the region of interest to both 
top and bottom roots simultaneously. The attributes are then extracted 
and labeled reversely from the region of interest of targeted contents. 
Since using conventional representations for the algorithm should 
require more computational power, the bar-tree method is developed 
to represent the generated patterns using bar graphs characterized 
by the depths and widths from the document roots. We show that 
this representation is suitable for extracting the data from the semi- 
structured web sources, and for detecting the template changes of 
targeted pages. The experimental results show perfect recognition 
rate for template changes in several web targets. 

Keywords — data extraction, data mining, web-based information 
system 



I. Introduction 

Text mining, especially from the web sources is getting 
important during the last decade. This is triggered by the 
exponentially growing number of websites with various types 
of information on the net. Most of them are providing the 
information generated from the structured data in an underly- 
ing database through certain predefined templates or layouts 
IT]. Following the great number of web pages in this kind 
which are already available on the net, these semi-structured 
web sources contain rich and unlimited valuable data for a 
variety of purposes. Extracting those data and then rebuilding 
them into a structured database are a challenge to realize an 
automatic data mining from web sources. 

Several methods for these purposes have been proposed 
previously in the literature. Some of them can be classified 
as the so-called wrappers J2], j3|, J4], 0, which have 
been briefly surveyed in [7 |. The wrapper technique allows an 
automatic data extraction through predefined wrapper created 
for each target data source. The wrappers then accepts a query 
against the data source and returns a set of structured results 
to the calling application. 
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On the other hand, there are several automatic methods 
without a manual initial learning process. For example, some 
methods are generating the template automatically from first 
multiple pages before extracting the rest of data based on 
the template flU, ED, ED- A more comprehensive method 
without requiring multiple pages has also been proposed using 
a page creation model which captures the main characteristics 
of semi-structured web pages to derive the set of properties 
ifTTl . Though, the last method is more intended to extract 
the lists of data records from a single web page with sibling 
subtrees. 

Since 2008 our group has worked on developing an online 
infrastructure with a major purpose of integrating the infor- 
mation related to science and technology across Indonesia, 
that is the Indonesian Scientific Index - ISI lfl2ll . However, 
in contrast with conventional approach where the data are 
collected through official connection under certain regulation, 
ISI integrates the data indirectly by harvesting certain web 
contents of the official websites of targeted institutions. Ini- 
tially the method was motivated by the failures of some con- 
ventional methods of data integration which always requires 
certain standard at any level and leads to additional works in 
the participating institutions. On the other hand, as a part of 
public responsibility all academic institutions have developed 
and launched various public information through their own 
websites. Therefore the idea of indirect data integration is 
welcomed by all participating institutions, since it is not like a 
dictatorship, more acceptable, much cheaper and more efficient 
for all parties than the conventional one which requires a 
kind of standardization among the information islands belong 
to separated institutions (fl~3|, lfl4l . The main problem is yet 
improving the accuracy of data retrieval and restructuring them 
into desired fields for further content analysis. 

The architecture of ISI is inspired and the combination 
of focused web-crawling and regular web-harvesting. The 
focused web-crawling does not indiscriminately crawl the web 
pages like general purpose search engines, but attempts to 
download pages that are similar to each other [15). Here, we 
follow the same line to harvest certain types of web pages 
with specific contents. Throughout the paper, let us call this 
method as the focused web-harvesting [14|. Nonetheless, as an 
official data integrator it must not allow any errors in the data 
collection. According to the data integration purposes and its 
requirement of high accuracy, the first version of focused web- 
harvesting technique at ISI adopts human intervention in the 
initial setup by providing the targeted URL of list of data, for 
instance the publication list, and defining the template of the 
final page contains the relevant information by labeling the 
attributes. In the case of detail information of a publication 
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Fig. 1. The example of the desired Rol from the displayed content of a scientific publication page on the web, shown by the areas inside the dashed 
rectangular. 



page as shown in Figs. Q] and [2] the relevant information and 
labeling are ranging from title, authors, abstract to the full- 
text if available. Comparing with the general purpose crawling 
|[T6l . or even focused web-crawling ifTTI . obviously ISI could 
retrieve the information more accurately due to its targeted 
contents and sources. The approach has also been realized and 
integrated in a user friendly web-based interface to enable the 
administrator to set up the initial parameters for each targeted 
sources. The toolkit has been released as an open source under 
GNU Public License at SourceForge.net 0~8)- 

Unfortunately, in spite of its high accuracy and unnecessary 
machine-learning like system, the method is suffered from 
tedious labor time at its initial setup to determine the region of 
interest (Rol), the tokens and to label the attributes. Moreover, 
it is lacking of the ability to detect effectively later changes of 
the targeted page templates. Any later changes of the targeted 
templates will again require human intervention to manually 
revise the parameters accordingly. Especially, because the 
labels and tokens are represented as DOM trees which are 
sensitive to later changes of targeted web templates [10], [ 19 1. 



On the other hand, we have also found that the previously 
known methods like wrapper induction Q, or the information 
extraction based on multi pattern discovery techniques [20|, 
are not suitable for our purpose since all of them by definition 
contain significant statistical errors which could burdens the 
initial purpose of official data integration. 

In our recent work [21 1, in order to overcome the above- 
mentioned problems a new method has been introduced to 
extract the Rol at the final targeted web pages and its tokens 
reversely from the Rol to both top and bottom roots, and 
further to label each attribute as usual. The technique is in 
contrast with the existing mechanisms which always start from 
the top root of targeted web pages or its Rol. It has been 
argued that the so-called reverse algorithm is more efficient 
and accurate to extract the tokens, and on the other hand 
to detect later template changes without any ambiguities. We 
should also remark that the algorithm is applicable for any 
existing methods for data extraction, especially the ones which 
require initial setup by human intervention to define and label 
the attributes. This is actually similar to the previous method 
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Fig. 2. The example of the desired Rol from the web source of a scientific publication page on the web in Fig. [T] shown by the areas inside the dashed 
rectangular. 



1 20 1, but instead of using the PAT tree [22] we use the DOM 
tree like mechanism [19|. 

In this paper, we further present an alternative representation 
for pattern discovery which is suitable for reverse algorithm 
mentioned above. The pattern is described using simple bar 
graphs characterized by its widths, depths and the partial and 
total squares. 

The paper is organized as follows. First, after this brief in- 
troduction we review shortly the reverse mechanism approach 
in Sec. [TT] In the subsequent section, Sec. [TTH we present the 
so-called bar-tree representation and its formalism. Finally we 
provide the experimental results of deploying the method to 
detect the template changes before summarizing the paper and 
discussing some future issues and further development. 



II. The reverse algorithm 

No matter the method used to extract and to label the tokens 
from a web template or layout, correct initial setup is crucial 
for further data extraction from web sources. As mentioned 
before, this point plays an important role for indirect data 
integration which has no tolerance to any errors. This makes 
some methods based on the machine learning system are 
useless. 

From now, please note that we are not going to discuss 
the algorithms to mine the labeled data since the tasks after 
labeling can be done further using any existing methods, nor to 
find the relevant pages of data list which has been discussed 
in our and many others' previous works llT4l . The reverse 
mechanism can be outlined recursively as follows : 



1) Determine the URL address of the final web page with 
desired information like Fig. Q] 

2) Provide the whole sentences of the Rol by copying and 
pasting the 'desired text' displayed on screen, not its 
source. 

3) Provide the whole sentence(s) of each sub-RoI and 
assign the attributes for each of them. 

4) Crawl the source of the final web page at 1. 

5) Parse and clean the text-format HTML tags like <b>, 
<i>, etc. 

6) Take the upper part of source from the top till the last 
one before the first sentence of Rol. Parse and clean all 
texts inside except the layout-format HTML tags, like 
<tr>, <span>, etc, Do the same thing for the lower 
part that is from the end of last sentence of Rol till the 
bottom. 

7) Count the number of 'open-tag' (n t) and 'closed-tag' 
(n c t) from the deepest part in term of desired content, 
that is the nearest tags from the Rol. 

We should stress here that there is no need for the adminis- 
trators to provide the web page sources at all. Open-tag here 
means the tags which have no pair in upper or lower part, 
while the closed-tag denotes the pairing tags within upper or 
lower part. Of course, our interest is only in the open-tag which 
should describe the whole structure of web template. 

Following the above procedures, we can obtain a kind of 
DOM tree as shown in Fig. [3] We can further count the number 
of trees according to the number of open-tags. Please remind 
that the counting is done horizontally, from left to right shown 
by the arrow in the figure. The number of trees in upper and 
lower parts are determined by, 



n t - n ct 



(1) 



Concerning all possibilities on the number of trees in upper 
and lower parts, therefore we can generally categorize the web 
structures through the discrepancies between both numbers as, 
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Supper ^lowcr 




(2) 



: fully symmetry 
: lower asymmetry 
: upper asymmetry 

Fig. [3] provides an example of tree in the case of Fig. Q]which 
is accidently asymmetry. That means the number of trees in 
the upper and lower parts are not the same, S upP cr ^ Slower- 
Again, we can use one of the existing methods to calculate 
the number of trees like the PAT tree algorithm 11221 and so 
forth. 

Through the discussion above, it is clear that the present 
method has several advantages : 

« We can separate independently the structure and the rules 
to obtain the Rol and the structure inside. 

• We can find out the template changes and its relevance 
with the desired Rol, since we can compare and see the 
pairing tokens between the upper and lower parts. 

• There is no need for further human intervention as long as 
the page containing the initial Rol still exist. The system 
uses the same Rol as keyword to perform regular check 



to detect template changes before recrawl the same target. 
Only if the content is removed by the owner, the system 
will defer the recrawling job at the target and send a 
warning to the administrator to choose another content 
as new keyword. 

We discuss these points in more detail through the real 
implementation at ISI in the subsequent Sec. II VI 

Further issue is then how to represent the method, not only 
visually, but also mathematically to enable more quantitative 
analysis in real implementation. 

III. The bar-tree representation 

Here we introduce the way to represent the reverse algo- 
rithm method in form of bar diagrams. The bar is characterized 
by its width (w) and height. The height is determined by the 
depth (d) of each column of attributes from the root document. 
On the other hand, the width is given by the number of 
parallel attributes (Pd) in certain depth weighted by a ratio (r) 
according to the depth. The definition is illustrated in Figs. [3] 
andgl 

Considering the mentioned-above definition, the width of 
each bar diagram can be written as follows, 



I 

I - (d - 1) i 
Pd-i 



W d -1 



for d = 
for d > 



(3) 



where / is the given initial width and r < I/d max is the 
appropriate ratio to decrease the width following the depth 
of a bar. The parallel attributes Pd is nothing else than the 
number of attributes at the same n— th depth. For instance, 
the 5th column in Fig. [3] has P5 = 4, while at the 9th column 
P 9 = 3 and so forth. 

According to the definition, the bar-tree representation in the 
case of Fig. [3] can be further depicted as Fig. |4] One should 
remark a rule that for Pd > 1, the (d + l)th bar should be 
drawn inside the appropriate order of bar in that depth. In the 
case of d = 5 in Fig. [4] since P5 = 4 and the interested bar is 
the 3rd one, the d — 6's bar is put inside the 3rd of d — 5's 
bar accordingly. 

Quantitatively, the pattern of bar-tree representation can be 
characterized by its partial and total squares. The square of 
each individual bar is given by, 
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for d = 

I-{d-l)r • (4) 
h — Wd-i for d > 



The "nett-square", that means the square of bar which is 
not overlapped with its lower bars, for each bar is given by, 



A nett = dx 7 g Wd-l -(rf-1) X 
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Using Eqs. (0 and ©, it subsequently leads to, 
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Fig. 3. The tree for the example of Rol given in Fig. [T] The dashed box denotes the Rol. 



Finally, the total square of pattern like Fig. becomes, 
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Since the initial width is arbitrary and determines only the 
normalization factor for the square, one can simply put wq = 
I = 1 in real implementation. 

Now we have all formulae at hand and are ready to 
implement it to harvest the information from web sources. 

IV. Implementation 

Our approach to web page information extraction has been 
experimentally implemented into the system of ISI. Following 
the wrapper induction programs which are usually supple- 
mented by a user friendly GUI, we have also developed a 
web based interface to perform the initial setup lED . 

ISI can now efficiently detect the changes of targeted web 
templates by comparing the old and new numbers of variable 
sets defined above as d max , ^4 total and Ad- It is done by 
executing the check procedure each time prior to the new 
crawling works to the same targeted web pages. Once all 



variables and keys have been stored in the system, it can be 
used not only for the initial setup but also for rechecking the 
templates from the same web sources in a regular basis without 
human intervention. 

In most cases, the discrepancies between the old and new 
numbers of variable set {d max , A tota1 } are already enough to 
detect easily the template changes time by time. However, fur- 
ther discrepancies and locating its details should be examined 
at each level of depth using Pd and Ad- The important point 
is, once the template changes have been detected, the system 
can be designed to automatically replace the old version 
of template with the new one without human intervention. 
However, if the content (of initial Rol) is missing, because 
in most cases removed by the owner, the system can be 
designed to defer the recrawling job at the target, and to send 
a kind of warning to the administrator. Then, in this case the 
administrator should choose another content as new keyword 
to locate the desired Rol. 

We have done an experimental running on a Linux PC (Core 
2 Duo 2.6 GHz processor, memory 2 GB) to perform initial 
extraction and to detect later template changes using 10,000 
data from 20 different sources available at ISI. The data belong 
to five categories with different maximum depths, i.e. d max = 
(5,10,15,20,25). 
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Fig. 4. The bar-tree representation with w and d denote the width and depth of each tree as depicted in Fig.|3]for r = 10% X /. 



The results for two cases representing the calcu- 
lation using different variable sets, {d max , A tota1 } and 
{d ma , K ,A total ,P dl A d }, are shown in Figs, g] and The time 
consumption in Fig. [6] means the running time in millisecond 
required for the whole processes from parsing the HTML 
till calculating the whole variables defined above. While the 
accuracy rate in Fig. [7] represents the successful rate to 
detect the template changes. The template changes were done 
randomly but intensively among the data using certain script. 
We should note that the template changes are completely same 
for both variable sets. 

From the figures, we can deduce several points : 

• As the number of d max is greater, it requires more time 
to calculate all variables and also decreases slightly the 
accuracy. The reason is obvious, since the trees with 
deeper structure have more probabilities and complexities 
of template changes. 

• Using more complete variable set {d max , ^4 total ? p d) A d } 
would improve significantly the accuracy to detect the 
template changes than the simpler one {d ma ^, A tota1 } 
without significant increasing in time. 

Because the variables A total could be accidently the same 
if the template changes occurred at the same depth d. For 
instance, if only the sequence in a depth d is different, 



then P d should remain unaltered. 
• Although the variable set {d max , A total , P d , A d } is 
enough in most applications, yet there is no guarantee to 
correctly detect the location of template changes. Fortu- 
nately, it is indeed not necessary in the reverse algorithm. 
Because once the template changes were detected, the 
new pattern is re-extracted from the Rol to replace the 
old one, no matter where the changes happened. 
If one requires, for certain needs, more accurate detection 
power, that is 100% in our experimental running, we recom- 
mend to perform further check using the variable A in Eq. 
(0. The discrepancies between the old and new numbers of 
A would unambiguously detect the template changes and its 
exact location. It can be summarized as below, 
1) No change at all 

^ncw y\old. 

Simultaneous changes with same size in both upper and 
lower trees : 

A new A old. yncw / y^old . yncw _L y^old 

' upper / upper' lower / lower 

Only either upper or lower tree has changed : 

^ncw ^old. j^new 

or : 



2) 



3) 



yncw 
upper 



y^old 
upper' 



yincw 
lower 



upper 



/ yold . yuiew 
/ upper' lower 



yiold 
lower 



yold 
lower 



A 1 



old. 
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old 
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Fig. 5. The total pattern of bar-tree representation in the case of Figs. [5] and \4\ 



4) Both upper and lower trees have changed differently : 

Anew / A old. yncw / y^old . y^ncw / y^old 
^ r <-*■ ^uppcr T ^uppcr' ^lowcr r ^lower 

Apparently, in the case 1 no need to alter the stored initial 
variables. In contrast, from the case 2, 3 and 4 we can deduce 
that the templates have been changed, either in the upper, lower 
or both trees. 

V. Conclusion 

We have discussed the bar-tree representation suitable for 
the reverse algorithm method to extract the Rol and to label the 
relevant attributes in the initial setup. The resulted patterns can 
be used further to automatically extract the data from crawled 
targeted web pages. We argue that the method with additional 
relevant web interface would reduce the administrator works 
significantly. On the other hand it improves the accuracy and 
speed of finding the tokens and labeling the attributes. Because 
the human intervention is basically required only during the 
initial setup. The pattern recognition in this method is done in 
an exact way without, for instance, any predefined parameters 
like threshold value etc. 

We have found that this method is very effective to detect 
the template changes, for instance newly inserted banners in 
the middle of upper or lower tree which often occurs in any 



websites and leads to difficulties in existing methods. The 
important point is it does not require huge computer power, 
nor further human intervention once the initial setup has been 
done. 

Through our experimental running, we can conclude that 
the focused web-harvesting deploying the combination of 
reverse algorithm and bar-tree representation is appropriate 
for the indirect data integration. The method performs the 
data collection over targeted web sources very accurately. We 
also recommend to perform the full set {d max , A total , Pd, Ad} 
rather than the simpler set {o? max , ^4 tota1 } to obtain results with 
much higher accuracy in a moderate time consumption. 

The real works on applying the method to further restructure 
the huge number of data stored at ISI is still under progress. 
The results and its effectiveness to restructure all relevant fields 
in a nation wide scientific index will be analyzed and reported 
elsewhere. Finally, we should remark that the method is also 
applicable for the web sources in a form of list of data. The 
work on this matter is also in progress. 
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Fig. 6. The measurement result of time consumption for detecting the template changes with various number of d m ax- The solid and dashed lines denote 
the performance with variable sets : {c/maxj A tot *\P d ,A d } and {d max , 7L tota1 }. 
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